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Shared  Versus  Distributed  Memory  Multiprocessors* 

Many  F.  Jordan 


ABSTRACT 

Hie  question  ot  whether  multiprocessors  should  have  shared  or  distributed  memory 
lias  attracted  ;•  great  -deal  oi  <d<ciii.u<i.  Some  icseaicners  argue  strongly  tor  bunding  ciis- 
tributed  memory  machines,  while  others  argue  just  as  strongK  for  programming  shared 
memory  multiprocessors.  A  great  ileal  of  research  is  underway  on  both  tvpes  of  parallel 
systems.  This  paper  puts  special  emphasis  on  systems  with  a  very  large  number  of  pro¬ 
cessors  tor  computation  intensive  tasks  and  considers  research  and  implementation 
trends.  It  appears  that  the  two  types  of  system  vs  ill  likely  converge  to  a  common  form  for 
large  scale  multiprocessors. 


*  I  his  work  was  supported  in  part  hy  the  National  Aeronautics  and  Space  Administration  under  NASA  contract  NASI  \8h05 
while  the  author  was  in  residence  at  IC'ASIa  Mail  Stop  132C,  NASA  l^mgley  Research  Center,  Hampton,  Va  23665,  and  in  part  hy 
the  National  Science  lotmdatton  under  (Irani  NSh -GH7- 17 773. 


What  Are  They? 

The  generic  term  parallel  processor  covers  a  wide  variety  of  architectures,  including 
SIMD  machines,  data  flow  computers  and  systolic  arrays.  The  issue  of  shared  versus  dis¬ 
tributed  memory  arises  specifically  in  connection  with  MIMD  computers  or  multiproces¬ 
sors.  These  are  sometimes  referred  to  simply  as  "parallel"  computers  to  distinguish  them 
from  vector  computers,  but  we  prefer  to  be  precise  and  call  them  multiprocessors  to 
avoid  confusion  with  the  generic  use  of  the  former  word.  Some  similar  sounding  but  dif¬ 
ferent  terms  are  often  used  in  a  confusing  way.  Multiprocessors  are  computers  capable 
of  running  multiple  instruction  streams  simultaneously  to  cooperatively  execute  a  single 
program.  Multiprogramming  is  the  sharing  of  a  computer  by  many  independent  jobs. 
They  interact  only  through  their  requests  for  the  same  resource.  Multiprocessors  can  be 
used  to  multiprogram  single  stream  (sequential)  programs.  A  process  is  a  dynamic 
instance  of  an  instruction  stream.  It  is  a  combination  of  code  and  process  state,  e.g.  pro¬ 
gram  counter  and  status  word.  Processes  are  also  called  tasks,  threads ,  or  virtual  proces¬ 
sors.  The  term  Multiprocessing  can  be  ambiguous  It  is  either: 

a)  Punning  »  program  (perhaps  sequential)  on  a  multiprocessor  or 

b)  Running  a  program  which  consists  of  several  cooperating  processes. 

The  interest  here  is  in  the  second  meaning  of  multiprocessing.  We  want  to  gain  high 
speed  in  scientific  computation  by  breaking  the  computation  into  pieces  which  are 
independent  enough  to  be  performed  in  parallel  using  several  processes  running  on 
separate  hardware  units  but  cooperative  enough  that  they  solve  a  single  problem. 

There  are  two  basic  types  of  MIMD  or  multiprocessor  architectures,  commonly 
called  shared  memory  and  distributed  memory  multiprocessors.  Figure  1  show's  block 
diagrams  of  these  two  types,  which  are  distinguished  by  the  way  in  which  values  com¬ 
puted  by  one  processor  reach  another  processor.  Since  architectures  may  have  mixtures 
of  shared  and  private  memories,  we  use  the  term  "fragmented"  to  indicate  lack  of  any 
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Figure  1:  Shared  and  distributed  memory  multiprocessors. 
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shared  memory.  Mixing  memories  private  to  specific  processors  with  shared  memory  in 
a  system  may  well  yield  a  better  architecture,  but  the  issues  can  be  discussed  easily  with 
respect  to  the  two  extremes:  fully  shared  memory  and  fragmented  memory. 

A  few  characteristics  are  commonly  used  to  distinguish  shared  and  fragmented 
memory  multiprocessors.  Starting  with  shared  memory  machines,  communication  of 
data  values  between  processors  is  by  way  of  memory,  supported  by  hardware  in  the 
memory  interface.  Interfacing  many  processors  may  lead  to  long  and  variable  memory 
latency.  Contributing  to  the  latency  is  the  fact  that  collisions  are  possible  among  refer¬ 
ences  to  memory.  As  in  uniprocessor  systems  with  memory  module  interleaving,  ran¬ 
domization  of  requests  may  be  used  to  reduce  collisions.  Distinguishing  characteristics 
of  fragmented  memory  rest  on  the  fact  that  communication  is  done  in  software  by  data 
transmission  instructions,  so  that  the  machine  level  instruction  set  has  send/receive 
instructions  as  well  as  read/write.  The  long  and  variable  latency  of  the  interconnection 
network  is  not  associated  with  the  memory  and  may  be  masked  by  software  which 
assembles  and  transmits  long  messages.  Collisions  of  Iona  messages  are  not  easily 
managed  by  randomization,  so  careful  management  of  communications  is  used  instead. 
The  key  question  of  how  data  values  produced  by  one  processor  reach  another  to  be  used 
by  it  as  operands  is  illustrated  in  Fig.  2. 

The  organizations  of  Fig.  1  and  the  transmission  mechanisms  of  Fig.  2  lead  to  a 
broad  brush  characterization  of  the  differences  in  the  appearance  of  the  two  types  of 
architecture  to  a  user.  A  shared  memory  multiprocessor  supports  communication  of  data 
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Figure  2:  Communication  of  data  in  multiprocessors. 
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entirely  by  hardware  in  the  memory  interface.  It  requires  short  and  uniform  latency  for 
access  to  any  memory  cell.  The  collisions  which  are  inevitable  when  multiple  processors 
access  memory  can  be  reduced  by  randomizing  the  references,  say  by  memory  module 
interleaving.  A  fragmented  memory  switching  network  involves  software  in  data  com¬ 
munication  by  way  of  explicit  send  and  receive  instructions.  Data  items  are  packed  into 
large  messages  to  mask  long  and  variable  latency.  Since  messages  are  long,  communica¬ 
tions  scheduling  instead  of  randomization  is  used  to  reduce  collisions.  To  move  an  inter¬ 
mediate  datum  from  its  producer  to  its  consumer  a  fragmented  memory  machine  ideally 
sends  it  to  the  consumer  as  soon  as  it  is  produced,  while  a  shared  memory  machine  stores 
it  in  memory  to  be  picked  up  by  the  consumer  when  it  is  needed. 

It  can  be  seen  from  Fig.  1  that  the  switching  network  which  communicates  data 
among  processors  occupies  two  different  positions  with  respect  to  the  classical,  von  Neu¬ 
mann,  single  processor  architecture.  In  shared  memory,  it  occupies  a  position  analogous 
to  that  of  the  memory  bus  in  a  classical  architecture.  In  the  fragmented  memory  case,  it 
is  independent  of  the  processor  to  memory  connection  and  more  analogous  to  an  I/O 
interface.  The  use  of  send  and  receive  instructions  in  the  fragmented  memory  case  also 
contributes  to  the  similarity  to  an  I/O  interface.  This  memory  bus  versus  I/O  channel 
nature  of  the  position  of  the  switching  network  underlies  the  naive  characterization  of  the 
differences  between  the  two  types  of  network.  A  processor  to  memory  interconnection 
network  involves  one  word  transfers  with  reliable  transmission.  The  address  (name)  of 
the  datum  controls  a  circuit  switched  connection  with  uniform  access  time  to  any  loca¬ 
tion.  Since  a  read  has  no  knowledge  of  a  previous  write,  explicit  synchronization  is 
needed  to  control  data  sharing.  In  contrast,  a  processor  to  processor  interconnection  net¬ 
work  supports  large  block  transfers  and  error  control  protocols.  Message  switching 
routes  the  information  through  the  network  on  the  basis  of  the  receiving  processor’s 
name.  Delivery  time  varies  with  the  source  and  destination  pair,  and  the  existence  of  a 
message  at  the  receiver  provides  an  implicit  form  of  synchronization. 

From  the  user’s  perspective,  there  are  two  distinct  naive  programming  models  for 
the  two  multiprocessor  architectures.  A  fragmented  memory  machine  requires  mapping 
data  structures  across  processors  and  the  communication  of  intermediate  results  using 
send  and  receive.  The  data  mapping  must  be  available  in  a  form  which  allows  each  pro¬ 
cessor  to  determine  the  destinations  for  intermediate  tesults  which  it  produces.  Large 
message  overhead  encourages  the  user  to  gather  many  data  items  for  the  same  destination 
into  long  messages  before  transmission.  If  many  processors  transmit  simultaneously,  the 
source/destination  pairs  should  be  disjoint  and  not  cause  congestion  on  specific  paths  in 
the  network.  The  user  of  a  shared  memory  machine  sees  a  shared  address  space  and 
explicit  synchronization  instructions  to  maintain  consistency  of  shared  data.  Synchroni¬ 
zation  can  be  based  on  program  control  structures  or  associated  with  the  data  whose  shar¬ 
ing  is  being  synchronized.  There  is  no  reason  to  aggregate  intermediate  results  unless 
synchronization  overhead  is  unusually  large.  Large  synchronization  overhead  leads  to  a 
programming  style  which  uses  one  synchronization  to  satisfy  many  write  before  read 
dependencies  at  once.  Better  performance  can  result  from  avoiding  memory  "hot  spots" 
by  randomizing  references  so  that  no  specific  memory  module  is  referenced  simultane¬ 
ously  by  many  processors. 
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Why  it  Isn’t  That  Simple 

The  naive  views  of  the  hardware  characteristics  and  programming  styles  for  shared 
and  fragmented  memory  multiprocessors  just  presented  are  oversimplified  for  several 
reasons.  First,  as  already  mentioned,  shared  and  private  memories  can  be  mixed  in  a  sin¬ 
gle  architecture,  as  shown  in  Fig.  3.  This  corresponds  to  reai  aspects  of  multiprocessor 
programs,  where  some  data  is  conceptually  private  to  the  processor  doing  an  individual 
piece  of  work.  The  program,  while  normally  shared  by  processors,  is  read  only  for  each 
and  should  be  placed  in  a  private  memory,  if  only  for  caching  purposes.  The  stack  gen¬ 
erated  by  most  compilers  normally  contains  only  private  data  and  need  not  be  in  shared 
memory.  In  addition,  analysis  done  by  many  parallel  compilers  identifies  some  shared 
data  as  read  only  and  thus  cachable  in  private  memory.  Some  multiprocessors  share 
memories  among  some,  but  not  all,  processors.  Examples  are  the  PAX[1]  and 
DIRMUJ21  computers.  These  machines  move  intermediate  data  by  having  its  producer 
place  it  in  the  correct  memory  and  its  consumer  retrieve  it  from  there.  The  transmission 
may  be  assisted  by  other  processors  if  producer  and  consumer  do  not  share  a  memory. 

Not  only  may  a  multiprocessor  mix.  shared  and  private  memories,  but  the  same 
memory  structure  may  have  different  appearances  when  viewed  at  different  system  lev¬ 
els.  An  important  early  multiprocessor  was  Cm*[3],  built  at  Carnegie  Mellon  University. 
An  abbreviated  block  diagram  of  the  architecture  is  shown  in  Fig.  4.  Processors  were 
attached  by  way  of  a  local  bus  to  memories  and  possibly  I/O  devices  to  form  computer 
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Notation: 

P  -  processor  M  -  memory 
S  -  switch  K  -  controller 
T  -  transducer  (I/O  device) 

Figure  3.  Shared  plus  private  memory  architecture. 


4 


T  T 

Figure  4:  Architecture  of  the  Cm*  multiprocessor. 

modules.  Several  computer  modules  were  linked  into  a  cluster  by  a  cluster  bus.  Proces¬ 
sors  could  access  the  memory  of  other  processors  using  the  cluster  bus.  Processors  in 
different  clusters  communicated  through  interconnected  mapping  controllers,  called 
K.rnaps.  The  name  K.map  and  some  of  the  behavior  of  Cm*  are  easier  to  understand  in 
light  of  the  fact  that  the  PDP-11  had  a  very  small  physical  address,  so  that  address  map¬ 
ping  was  essential  to  accessing  any  large  physical  memory,  shared  or  not. 

Not  only  does  Cm*  illustrate  a  mixture  of  shared  and  fragmented  memory  ideas,  but 
there  are  three  answers  to  the  question  of  whether  Cm*  is  a  shared  or  fragmented 
memory  multiprocessor.  At  the  microcode  level  in  the  K.map,  there  are  explicit  send  and 
receive  instructions  and  message  passing  software,  thus  making  the  Cm*  appear  to  be  a 
fragmented  me.noi  machine.  At  the  PDP-11  instruction  set  level,  the  machine  has 
shared  memory.  There  were  no  send  and  receive  insouctions,  and  any  memory  cell 
could  be  accessed  by  any  processor.  The  page  containing  the  memory  address  had  to  be 
mapped  into  the  processor’s  address  space,  but  as  mentioned,  this  was  a  standard 
mechanism  for  the  PDP-1 1.  A  third  answer  to  the  question  appeared  at  the  level  of  pro¬ 
grams  running  under  an  operating  system.  Two  operating  systems  were  built  for  Cm*. 
The  processes  which  these  operating  systems  supported  were  not  allowed  to  share  any 
memory.  They  communicated  through  operating  system  calls  to  pass  messages  between 
processes.  Thus  at  this  level  Cm*  became  a  fragmented  memory  machine  once  more. 

Taking  the  attitude  that  a  machine  architecture  is  characterized  by  its  native  instruc¬ 
tion  set,  we  should  call  Cm*  a  shared  memory  machine.  A  litmus  test  for  a  fragmented 
memory  machine  could  be  the  existence  of  distinct  send  and  receive  instructions  for  data 
sharing  in  the  processor  instruction  set.  The  Cm*  is  an  example  of  shared  memory 
machines  with  non-uniform  memory  access  time,  sometimes  called  NUMA  machines.  If 
access  to  a  processor’s  local  memory  took  one  time  unit,  then  access  via  the  cluster  bus 
required  about  three  units  and  access  to  memory  in  another  cluster  took  about  20  units. 
Writing  programs  under  cither  operating  system  followed  the  programming  paradigm  for 
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fragmented  memory  multiprocessors,  with  explicit  send  and  receive  of  shared  data,  but 
performance  concerns  favored  large  granularity  cooperation  less  strongly  than  in  a  truly 
fragmented  memory  machine. 

A  more  recent  NUMA  shared  memory  multiprocessor  is  the  BBN  Butterfly ( 4 1 . 
References  to  non-local  memory  take  about  three  times  as  long  as  local  references.  The 
Butterfly  processor  to  memory  interconnection  network  also  contradicts  the  naive  charac¬ 
terization  of  shared  memory  switches.  The  network  connecting  N  processors  to  N 
memories  is  a  multistage  network  with  iog2/V  stages,  and  thus  (N l2)\o%iN  individual 
links.  It  thus  has  a  potentially  high  concurrency,  although  collisions  are  possible  when 
two  memory  references  require  the  same  link.  Read  and  write  data  are  sent  through  the 
network  as  messages  with  a  self  routing  header  which  establishes  a  circuit  over  which  the 
data  bits  follow.  Messages  are  pipelined  a  few  bits  at  a  time,  and  long  data  packets  of 
many  words  can  use  the  circuit,  once  established.  Thus,  although  single  word  transfers 
are  the  norm,  higher  bandwidths  can  be  achieved  by  packing  data  into  a  multiword 
transmission.  Messages  attempting  to  reference  a  memory  which  is  in  use,  or  colliding 
w'ith  others  in  the  switch,  fail  and  are  retried  by  the  processor. 

Finally,  the  naive  view  of  the  difference  between  implicit  synchronization  in  frag¬ 
mented  memory  and  the  need  for  explicit  synchronization  with  shared  memory  should  be 
challenged.  A  shared  memory  synchronization  based  on  data  rather  than  control  struc¬ 
tures  is  that  of  asynchronous  variables.  Asynchronous  variables  have  a  state  as  well  as  a 
value.  The  state  has  two  values,  usually  called  full  and  empty ,  which  control  access  to 
the  variable  by  two  operations,  produce  and  consume.  Produce  waits  for  the  state  to  be 
empty,  writes  the  variable  with  a  new  value,  and  sets  the  state  to  full.  Consume  waits  for 
the  state  to  be  full,  reads  tire  value,  and  sets  the  state  to  empty.  Both  are  atomic  opera¬ 
tions,  or  in  general  obey  the  serialization  principle.  Void  and  copy  operations  are  often 
supplied  to  initialize  the  state  to  empty,  and  to  wait  for  full,  read  and  leave  full,  respec¬ 
tively.  The  HEP|5]  and  Cedar[6)  computers  supported  these  operations  on  memory  cells 
in  hardware. 

When  data  is  put  in  memory  by  one  processor  using  produce  and  read  by  another 
using  consume ,  the  transaction  behaves  like  a  one  word  message  from  producer  to  consu¬ 
mer,  with  minor  differences.  The  memory  cell  serves  as  a  one  word  buffer,  and  may  be 
occupied  when  produce  is  attempted.  The  producer  need  not  name  the  consumer; 
instead,  both  name  a  common  item  as  when  send  and  receive  are  linked  to  a  common 
communications  channel  name.  Another  difference  is  that  one  produce  and  multiple 
copys  suffice  to  deliver  the  same  datum  to  multiple  receivers. 


Abstraction  of  Characteristics 

The  essence  of  the  problem  to  be  addressed  by  the  switching  network  in  both  shared 
and  fragmented  memory  multiprocessors  is  the  communication  of  data  from  a  processor 
producing  it  to  one  which  will  use  it.  This  process  can  slow  parallel  computation  when 
either  the  producer  is  delayed  in  transmitting  or  when  the  consumer  is  delayed  in  receiv¬ 
ing.  This  process  can  be  abstracted  in  terms  of  four  characteristics:  initiation  of  transmis¬ 
sion  to  the  data’s  destination,  synchronization  of  production  and  use  of  the  data,  binding 
of  the  data’s  source  to  its  destination,  and  how  transmission  latency  is  dealt  with.  Table 
1  summarizes  these  characteristics  and  tabulates  them  for  the  traditional  views  of  shared 
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Characteristics 

Fragmented  Memory 

Shared  Memory 

Initiation 

Producer 

Consumer 

Synchronization 

Implicit  by  message  existence 

Explicit 

Binding 

Processor  name 

Data  name 

Latency 

Masked  by  early  send 

Consumer  waits 

Table  1:  Data  Sharing  in  Multiprocessors, 
and  fragmented  memory  multiprocessors. 

The  initiation  of  data  delivery  to  its  consumer  is  a  key  characteristic  and  influences 
the  others.  Producer  initiated  delivery  characterizes  the  programming  model  of  frag¬ 
mented  memory  multiprocessors.  It  implies  that  the  producer  knows  the  identity  of  the 
consumer,  so  that  binding  by  processor  name  can  be  used,  and  provides  the  possibility  of 
implicit  synchronization  when  the  consumer  is  informed  of  the  arrival  of  data.  If  a  pro¬ 
ducer  in  a  shared  memory  multiprocessor  were  forced  to  write  data  into  an  asynchronous 
variable  in  a  section  of  memory  uniquely  associated  with  the  consumer,  the  programming 
model  would  be  much  the  same  as  for  fragmented  memory.  Consumer  initiated  access  to 
data  assumes  a  binding  where  the  identity  of  the  data  allows  a  determination  of  where  it 
resides.  Since  fhe  consumer  operation  is  decoupled  from  the  data’s  writing  by  its  pro¬ 
ducer,  explicit  synchronization  is  needed  to  guarantee  validity.  One  can  imagine  a  frag¬ 
mented  memory  system  in  which  part  of  a  data  item’s  address  specifies  its  producer  and  a 
sharing  protocol  in  which  the  consumer  sends  a  request  message  to  the  owner  (producer) 
of  a  required  operand.  An  interrupt  could  cause  the  owner  to  satisfy  the  consumer’s 
request,  yielding  a  consumer  initiated  data  transmission.  Such  a  fragmented  memory 
system  would  be  programmed  like  a  shared  memory  machine.  Binding  is  by  data  name, 
and  the  consumer  has  no  implicit  way  of  knowing  the  data  it  requests  has  been  written 
yet,  so  explicit  synchronization  is  required. 

Too  many  explicit  synchronization  mechanisms  are  possible  to  attempt  a  complete 
treatment,  and  sufficient  characterization  for  our  purposes  has  already  been  given.  Since 
message  delivery  is  less  often  thought  of  in  terms  of  synchronization,  Table  2  summar¬ 
izes  the  types  synchronization  associated  with  message  delivery.  Different  requirements 
are  placed  on  the  operating  or  run  time  system  and  different  precedence  constraints  are 
imposed  by  the  possible  combinations  of  blocking  and  non-blocking  send  and  receive 
operations. 

Types  of  binding  between  producer  and  consumer  in  fragmented  memory  systems 
include:  source/destination  pair,  channel,  and  destination/type.  In  the  case  of 
source/destination,  the  send  operation  names  the  destination  and  receive  names  the 
source.  A  message  can  be  broadcast,  or  sent  to  multiple  receivers,  but  not  received  from 
multiple  sources.  Source  thus  designates  a  single  processor  while  destination  might 
specify  one  or  more.  Message  delivery  can  also  be  through  a  "channel"  or  mailbox.  In 
this  case  send  and  receive  are  connected  because  both  specify  the  same  channel.  A  chan¬ 
nel  holds  a  sequence  of  messages,  limited  by  the  channel  capacity.  To  facilitate  a 
receiver  handling  messages  from  several  sources,  a  sender  can  specify  a  "type"  for  the 
message  and  the  receiver  ask  for  the  next  message  of  that  type.  The  source  is  then  not 
explicitly  specified  by  the  receiver  but  may  be  supplied  to  it  as  part  of  the  message. 
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Message 

Synchronization 

System 

Requirements 

Precedence  Constraints 

Send:  nonblocking 

Message  buffering 

None,  unless  message  is 

Receive:  nonblocking 

Fail  return  from  receive 

received  successfully 

Send:  nonblocking 

Message  buffering 

Actions  before  send  precede 

Receive:  blocking 

Termination  detection 

those  after  receive 

Send:  blocking 

Termination  detection 

Actions  before  receive  precede 

Receive:  nonblocking 

Fail  return  from  receive 

those  after  send 

Send:  blocking 

Termination  detection 

Actions  before  rendezvous 

Receive:  blocking 

Termination  detection 

precede  ones  after  it 
in  both  processes. 

Table  2:  Summary  of  the  types  of  message  synchronization. 

Binding  in  shared  memory  is  normally  by  data  location,  but  note  that  the  Linda[7]  shared 
tuple  memory  uses  content  addressability,  which  is  somewhat  like  the  "type”  binding  just 
mentioned. 

The  problem  of  latency  in  sharing  data  and  how  it  is  dealt  with  is  the  most  impor¬ 
tant  issue  in  the  performance  of  multiprocessors.  At  the  lowes.  level  it  is  tied  up  with  the 
latency  and  concurrency  of  the  switch.  Two  slightly  different  concepts  should  be  dis¬ 
tinguished.  If  T„  is  the  time  at  which  a  send  is  issued  in  a  message  passing  system  and 
Tr  is  the  time  at  which  the  corresponding  receive  returns  data,  then  the  latency  is 
Tl  -  Tt  ~Ts.  The  transmission  time  for  messages  often  has  an  initial  startup  overhead 
and  a  time  per  unit  of  information  in  the  message,  of  the  form  r,  +kiu,  where  k  is  the 
number  of  units  transmitted.  The  startup  time  f,-  is  less  than  Tj, ,  but  is  otherwise  unre¬ 
lated.  In  particular,  if  TL  is  large,  several  messages  can  be  sent  before  the  first  one  is 
received.  The  granularity  of  data  sharing  is  determined  by  the  relationship  of  r,  to  tu .  If 
li  »  tu  good  performance  dictates  k  »  1 ,  making  the  granularity  coarse.  If  t;  -  tu  the 
fine  granularity  case  of  k  =  1  suffers  little  performance  degradation.  Read  and  write  in  a 
shared  memory  switch  must  at  least  have  small  t,  so  that  data  transmissions  with  small  k 
perform  well.  A  fine  granularity  switch  with  small  startup  r,-  may  still  have  a  large 
latency  TL,  and  this  is  the  concern  of  the  fourth  characteristic  in  Table  1. 

Latency  must  grow  with  the  number  of  processors  in  a  system,  if  only  because  its 
physical  size  grows  and  signal  transmission  is  limited  by  the  speed  of  light.  As  the  sys¬ 
tem  size  grows,  the  key  question  is  how  the  inevitable  latency  is  dealt  with.  An  architec¬ 
ture  in  which  latency  does  not  slow  down  individual  processors  as  the  number  of  them 
increases  is  called  scalable.  Scalability  is  a  function  both  of  how  latency  grows  and  how 
it  is  managed.  Message  latency  can  be  masked  by  overlapping  it  with  useful  computa¬ 
tion.  Figure  5  shows  a  send/receive  transaction  in  a  fragmented  memory  system.  In  part 
a)  message  latency  is  successfully  overlapped  by  computation  in  the  consumer  whereas 
in  part  b)  the  consumer  does  not  have  enough  to  do  before  needing  the  data  in  order  to 
completely  mask  the  latency.  In  reference  to  Fig.  5,  scalability  is  is  a  function  of  how  the 
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a)  Message  latency  well  masked  by  computation. 
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b)  Poorly  masked  message  latency. 

Figure  5:  Masking  message  latency  by  computation. 

program  doing  the  sends  and  receives  is  organized.  The  ratio  of  available  overlapping 
computation  to  message  latency  decreases  as  system  size  grows,  both  because  latency 
grows  and  because  computation  is  more  finely  divided. 

In  shared  memory  multiprocessors  the  consumer  initiation  of  access  to  data  when 
needed  eliminates  the  possibility  of  arranging  the  program  so  that  sends  occur  early 
enough  to  mask  latency.  Latency  can  be  managed  in  this  case,  as  in  the  other,  by  reduc¬ 
ing  it  or  by  masking  it  with  useful  computation.  Latency  reduction  in  the  shared  memory 
hardware  regime  is  done  by  caching  and  latency  masking  by  pipelining  or  multiprogram¬ 
ming.  In  the  naive  view,  scalability  is  a  hardware  concern  in  shared  memory  but  more  a 
function  of  program  structure  in  fragmented  memory,  leading  to  the  notion  of  software 
scalability.  Assuming  infinitely  fast  transmission,  networks  with  P  prwessors  and  a  rea¬ 
sonable  number  of  switching  nodes  usually  have  latency  on  the  order  of  logmP ,  where  m 
is  the  number  of  input  and  output  pons  per  switch  node.  If  finite  speed  of  signal 
transmission  is  an  issue,  latency  is  proportional  to  the  cube  root  of  P  for  a  svsiem  build- 
able  in  three  dimensional  space  and  to  the  square  root  of  P  if  messages  occupy  volume. 

Concurrency  of  the  switch  also  has  an  influence  on  latency.  I;  must  clearly  have  a 
concurrency  much  greater  than  one  for  any  multiprocessor  with  more  than  a  very  few 
processors.  Using  a  single  bus  for  this  switch  is  inadmissible  tn  all  but  the  smallest  of 
systems.  For  scalability,  concurrency  should  grow  linearly  with  the  number  of 
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processors:  otherwise  the  lack  of  physical  network  paths  will  lead  to  long  latencies  when 
many  processors  use  the  swiich  s  imultaneously.  Even  with  order  P  links,  collisions 
between  messages  can  occur  under  unfavorable  access  patterns.  The  way  to  control  col¬ 
lisions  is  a  function  of  granularity,  in  a  fine  granularity  network,  ’-andomization  which 
distributes  the  small  transactions  uniformly  over  the  network  is  usually  appropriate. 
With  large  granularity  transactions,  random’ zation  is  less  effective,  and  scheduling  of  the 
transactions  may  be  required. 

Thus  the  abstract  differences  between  shared  and  fragmented  memory  multiproces¬ 
sors  rest  on  me  four  characteristics  of  Table  1,  with  the  selection  of  producer  or  consu¬ 
mer  .nitiation  of  data  delivery  having  a  strong  influence  on  the  other  three.  Consumer 
initiation  is  naively  associated  with  explicit  synchronization,  data  name  binding,  and 
latency  reduction.  Producer  initiation  suggests  implicit  synchronization,  processor  name 
binding,  and  latency  tolerance  by  executing  sends  early. 


Conve  mce 

The  direction  of  current  developments  in  shared  and  fragmented  memory  multipro¬ 
cessors  is  generally  toward  convergence.  The  desire  to  write  programs  with  a  shared 
name  space  for  fragmented  memory  machines  is  supported  both  by  research  on  virtual 
shared  memory  using  paging  techniques  and  by  automatic  compiled  or  preprocessed  gen¬ 
eration  of  sends  and  receives  for  remote  data  references.  Multiprogramming  the  nodes  of 
a  fragmented  memory  multiprocessor  can  also  increase  the  amount  of  computation  avail¬ 
able  to  mask  latency.  Virtual  processors  make  use  of  the  idea  of  parallel  slackness,  or 
using  of  some  of  a  problem’s  inherent  parallelism  to  control  latency.  In  shared  memory 
multiprocessors,  considerable  work  is  being  applied  to  multiprocessor  caching,  which 
distributes  shared  data  among  processors  to  reduce  latency.  Hardware  cache  manage¬ 
ment,  software  cachability  analysis,  and  correct  placement  and  copying  in  NUMA 
machines  have  been  considered.  Much  attention  has  been  given  to  fast,  packet  switched, 
multistage  interconnection  networks  for  use  in  the  processor  to  memory  interface,  and 
pipelining  techniques  have  been  applied  to  tolerate  the  inevitably  large  latency  of  such 
net"  orks  connecting  many  processors 

Support  for  a  shared  name  space  on  fragmented  memory  multiprocessors  takes 
several  forms.  Li]  8]  has  considered  using  paging  techniques  to  produce  a  shared 
memory  address  space  on  a  fragmented  machine.  If  the  paging  is  heavily  supported  by 
hardware  convergence  is  easily  seen  between  this  w'ork  and  the  work  on  multiprocessor 
caching  exemplified  by  |9|.  Another  approach  uses  program  analysis  to  automatically 
generate  the  sends  and  receives  required  to  move  data  from  its  producer  to  its  consumer. 
For  regular  access  patterns,  the  user  can  specify  data  mapping  and  a  language  like 
DINO(l()|  can  generate  message  transmissions  to  satisfy  non-local  references.  When 
regular  access  patterns  are  generated  by  loops  in  automatic  parallelization  of  a  sequential 
program]  1 1  ],  the  more  constrained  structure  allows  even  more  of  the  mapping  and  data 
movement  to  be  generated  automatically  by  the  compiler. 

Automating  data  mapping  across  distributed  memories  has  a  long  history  and  might 
be  typified  by  the  work  of  Berman  and  Snyder]  12].  If  access  patterns  are  data  dependent, 
as  in  computations  on  machine  generated  grids,  they  may  still  be  constant  over  long 
periods.  It  may  Men  be  beneficial  to  bind  addresses  and  generate  data  movement  using  a 
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preprocessor[13]  which  acts  at  run-time,  after  data  affecting  addresser,  is  known,  but 
before  the  bulk  of  the  computation,  which  is  often  iterative,  is  carried  out.  Preprocessor 
work  can  thus  be  amortized  over  many  iterations  with  the  same  access  pattern.  Conver¬ 
gent  work  for  shared  memory  has  taken  place  in  connection  with  NUMA  architectures. 
The  BBN  Butterfly  provides  support  for  placement  and  copying  to  reduce  the  penalty  for 
long  memory  references.  Software  places  private  data  in  the  local  memory  of  its  proces¬ 
sor  and  randomizes  references  to  structures  such  as  arrays  over  memory  modules  to  avoid 
memory  "hot  spots"[14]. 

Finally,  convergence  in  latency  hiding  techniques  is  seen  between  the  use  of  virtual 
processors  in  fragmented  memory  and  pipelining  in  shared  memory  multiprocessors.  If 
we  attempt  to  use  consumer  initiation  in  fragmented  memory  by  interrupting  the  owner 
of  a  datum  with  a  request  for  transmission,  we  see  a  behavior  like  that  of  Fig.  6  a).  In 
order  to  make  use  of  the  long  wait  resulting  from  consumer  initiation  of  the  delivery,  the 
processor  executing  the  consumer  process  can  be  switched  to  another  process,  as  shown 
in  Fig.  6  b).  If  the  process  is  associated  with  a  different  program,  we  have  the  standard 
technique  of  masking  latency  by  multiprogramming,  which  is  used  in  masking  disk 
latency  in  virtual  memory  systems.  If  the  extra  process  is  associated  with  the  same  paral¬ 
lel  program,  we  have  a  partly  time  multiplexed  form  of  multiprocessing  often 
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Figure  6:  Consumer  initiated  transmission  in  a  fragmented  memory  system. 
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characterized  by  the  naw  virtual  processors.  The  use  of  virtual  processors  to  enhance 
performance  has  recently  been  most  frequently  discussed  in  relation  to  an  SIMD  archi¬ 
tecture,  the  Connection  Machine[15],  where  it  is  important  for  masking  latency  arising 
from  several  different  sources.  If  each  processor  of  a  fragmented  memory  multiprocessor 
time  multiplexes  several  processes  so  that  message  latency  in  the  communication  net¬ 
work  is  overlapped  with  useful  computation,  a  time  snapshot  of  message  traffic  and  pro¬ 
cessor  activity  might  appear  as  in  Fig.  7. 

An  early  use  of  multiprocessing  to  mask  memory  latency,  as  opposed  to  I/O  latency, 
was  in  the  peripheral  processors  of  the  CDC  6600[  1 6] .  Ten  slow  peripheral  processor 
memories  were  accommodated  by  time  multiplexing  ten  virtual  processors  on  a  single  set 
of  fast  processor  logic.  Process  contexts  were  switched  on  a  minor  cycle  basis.  Later, 
the  Denelcor  HEP  used  fine  grained  multiprocessing  to  mask  latency  in  a  shared,  pipe¬ 
lined  data  memory.  The  concept  of  pipelined  multiprocessing  is  illustrated  in  Fig.  8. 
Round  robin  issuing  of  a  set  of  process  states  into  the  unified  pipeline  is  done  on  a  minor 
cycle  basis.  Processes  making  memory  references  are  queued  separately  to  be  returned 
to  the  execution  queue  when  satisfied.  Pipeline  interlocks  are  largely  unnecessary  since 
instructions  which  occupy  the  pipeline  simultaneously  come  from  different  processes  and 
can  only  depend  on  each  other  through  explicit  synchronization. 

For  latency  to  be  masked  by  satisfying  requests  at  a  higher  rate  than  processor- 
memory  latency  would  imply,  many  requests  must  be  in  progress  simultaneously.  This 
implies  a  pipelined  switch  between  processors  and  memory,  and  possibly  pipelining  the 
memory  also.  Pipelining  and  single  word  access  together  imply  a  low  overhead,  message 
switched  network.  Variable  traffic  in  the  shared  network  requires  a  completion  report  for 
each  transaction,  regardless  of  whether  it  is  a  read  or  write.  Whether  the  memory 
modules  themselves  are  pipelined  or  not  depends  on  the  ratio  of  the  module  response 
time  to  the  step  time  of  the  pipelined  switch.  If  the  memory  module  responds 
significantly  slower  than  the  switching  network  can  deliver  requests,  memory  mapping 
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Figure  7:  Masking  Message  Transmission  with  Multiprogramming. 
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Figure  8:  One  execution  unit  of  a  pipelined  multiprocessor. 

and  address  decoding  are  obvious  places  to  use  pipelining  within  the  memory  itself.  Fig¬ 
ure  9,  which  bears  an  intentional  resemblance  to  Fig.  7,  shows  an  activity  snapshot  in  a 
system  built  of  multiple  pipelined  multiprocessors  which  mask  the  latency  of  multiple 
read  and  write  operations  in  the  processor  to  memory  switch. 
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Figure  8:  Pipelined  Multiprocessors  in  a  Shared  Memory  Multiprocessor  System. 
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Convergence  can  also  be  seen  in  switching  network  research.  Packet  switched  pro¬ 
cessor  to  memory  interconnections  such  as  that  in  the  NYU  Ultracomputer[17]  bear  a 
strong  resemblance  to  communication  networks  used  in  message  passing  distributed 
memory  computers.  Previously,  the  store  and  forward  style  of  contention  resolution  was 
only  seen  in  communications  networks  carrying  information  packets  much  larger  than 
one  memory  word.  There  is  also  a  strong  resemblance  between  the  "cut-through  rout¬ 
ing"  [18]  recently  introduced  in  fragmented  memory  multiprocessors  and  the  previously 
mentioned  header  switched  connections  made  by  messages  in  the  BBN  Butterfly  shared 
memory  switch. 


Conclusions 

The  question  of  what  one  concludes  from  all  this  is  really  a  question  of  what  one  is 
led  to  predict  for  the  future  of  multiprocessors.  The  predictions  can  be  formulated  as  the 
answers  to  three  questions:  What  will  be  the  programming  model  and  style  for  multipro¬ 
cessors?  How  will  the  system  architecture  support  this  model  of  computation?  What 
will  be  the  split  between  hardware  and  software  in  contributing  to  this  system  architec¬ 
ture? 

The  programmer  will  surely  reference  a  global  name  space.  This  feature 
corresponds  too  closely  to  the  way  we  formulate  problems,  and  too  much  progress  has 
been  made  toward  supporting  it  on  widely  different  multiprocessor  architectures,  for  us 
to  give  it  up.  It  also  seems  that  most  synchronization  will  be  data  based  rather  than  con¬ 
trol  based.  Associating  the  synchronization  with  the  objects  whose  consistency  it  is  sup¬ 
posed  to  preserve  is  more  direct  and  less  error  prone  than  associating  it  with  the  control 
flow  of  one  or  more  processes.  Programs  will  have  more  parallelism  than  the  number  of 
physical  processors  in  the  multiprocessor  expected  to  run  them,  with  the  extra  parallelism 
being  used  to  mask  latency. 

Multiprocessor  architecture  will  consist  of  many  processors  connected  to  many 
memories.  A  portion  of  the  memory  will  be  globally  interconnected  by  way  of  a  high 
concurrency  switch.  The  switch  will  have  a  latency  which  scales  as  logmP  for  moderate 
speed  systems,  with  m  probably  greater  than  two.  For  the  highest  speed  systems,  the 
latency  will  scale  as  P  v2.  Multiprocessors  wil’  use  a  Harvard  architecture,  separating  the 
program  memory  from  data  memory  to  take  advantage  of  its  very  different  access  pat¬ 
terns.  Data  memory  private  to  each  processor  will  be  used  to  store  the  stack,  other  pro¬ 
cess  private  data  and  copies  of  read  only  shared  data.  Only  truly  shared  data  will  reside 
in  the  shared  memory. 

A  combination  of  software  and  hardware  techniques  will  be  used  to  mask  the 
latency  inherent  in  data  sharing.  Compiler  analysis  will  be  the  main  mechanism  for 
determining  what  data  is  truly  shared.  It  may  even  generate  code  to  dynamically  migrate 
data  into  private  memories  for  a  long  program  section  during  which  it  is  not  shared.  The 
hardware  will  time  multiplex  (pipeline)  multiple  processes  on  each  processor  at  a  very 
fine  granularity  in  order  to  support  latency  masking  by  multiplexed  computation.  Some 
of  the  multiprocessor  cache  research  may  find  use  in  partially  supporting  the  data  migra¬ 
tion  with  hardware,  but  a  knowledge  of  reference  patterns  is  so  important  to  data  sharing 
that  it  is  unlikely  that  the  hardware  will  forego  the  increasingly  effective  assistance  avail¬ 
able  from  the  compiler. 
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In  short,  the  hardware,  assisted  by  the  compiler,  of  multiprocessor  systems  can  do 
much  more  than  we  currently  ask  of  it.  Moving  software  mechanisms  into  hardware  pro¬ 
duces  a  significant  performance  gain,  and  should  be  done  when  a  mechanism  is  well 
understood,  proven  effective  and  of  reasonably  low  complexity.  Finally,  although 
automatic  parallelization  has  been  poorly  treated  in  this  paper,  it  is  perhaps  possible  to 
say  that,  in  spite  of  the  excellent  work  done  in  turning  sequential  programs  into  parallel 
ones,  a  user  should  not  take  great  pains  in  a  new  program  to  artificially  sequentialize 
naturally  parallel  operations  so  that  they  can  be  done  on  a  computer. 
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